|
|
@@ -31,7 +31,7 @@ module Agents
|
31
|
31
|
|
32
|
32
|
# Scraping HTML and XML
|
33
|
33
|
|
34
|
|
- When parsing HTML or XML, these sub-hashes specify how each extraction should be done. The Agent first selects a node set from the document for each extraction key by evaluating either a CSS selector in `css` or an XPath expression in `xpath`. It then evaluates an XPath expression in `value` on each node in the node set, converting the result into string. Here's an example:
|
|
34
|
+ When parsing HTML or XML, these sub-hashes specify how each extraction should be done. The Agent first selects a node set from the document for each extraction key by evaluating either a CSS selector in `css` or an XPath expression in `xpath`. It then evaluates an XPath expression in `value` (default: `.`) on each node in the node set, converting the result into string. Here's an example:
|
35
|
35
|
|
36
|
36
|
"extract": {
|
37
|
37
|
"url": { "css": "#comic img", "value": "@src" },
|
|
|
@@ -39,7 +39,7 @@ module Agents
|
39
|
39
|
"body_text": { "css": "div.main", "value": ".//text()" }
|
40
|
40
|
}
|
41
|
41
|
|
42
|
|
- "@_attr_" is the XPath expression to extract the value of an attribute named _attr_ from a node, and ".//text()" is to extract all the enclosed texts. To extract the innerHTML, use "./node()"; and to extract the outer HTML, use ".".
|
|
42
|
+ "@_attr_" is the XPath expression to extract the value of an attribute named _attr_ from a node, and ".//text()" is to extract all the enclosed texts. To extract the innerHTML, use "./node()"; and to extract the outer HTML, use ".".
|
43
|
43
|
|
44
|
44
|
You can also use [XPath functions](http://www.w3.org/TR/xpath/#section-String-Functions) like `normalize-space` to strip and squeeze whitespace, `substring-after` to extract part of a text, and `translate` to remove comma from a formatted number, etc. Note that these functions take a string, not a node set, so what you may think would be written as `normalize-space(.//text())` should actually be `normalize-space(.)`.
|
45
|
45
|
|
|
|
@@ -373,7 +373,7 @@ module Agents
|
373
|
373
|
case nodes
|
374
|
374
|
when Nokogiri::XML::NodeSet
|
375
|
375
|
result = nodes.map { |node|
|
376
|
|
- case value = node.xpath(extraction_details['value'])
|
|
376
|
+ case value = node.xpath(extraction_details['value'] || '.')
|
377
|
377
|
when Float
|
378
|
378
|
# Node#xpath() returns any numeric value as float;
|
379
|
379
|
# convert it to integer as appropriate.
|